Initially, our focus was on the correlation between affordable housing and educational achievement. However, during the process of selecting appropriate educational proxies, we found ourselves delving into the factors behind the measurements of these proxies. As a result, we decided to seek the guidance of Professor Lesley Lavery, who specializes in public policies in education. We hope to gain a better understanding of the educational proxies currently in use, and to investigate whether there are additional factors that impact educational outcomes and potentially render the current proxies interchangeable or distinct from other measures.
The data set we have includes a range of school-related variables such as location details, funding, and aggregated scores in various subjects. Specifically, the score variables cover the general grade-cohort-standardized achievement score, as well as scores in reading, science, and physical education.
We aggregate our dataset from 5 different datasets. We use data from The California Department of Education, Georgetown University, and the Educational Opportunity Project at Stanford University.
Science Testing Data Codebook
Our science test data is from the California Department of Education, specifically, from the 2021-2022 school year. It is from the California Science Test, in which there are three different categories, namely Life sciences, Physical sciences, and Earth and Space sciences.
English Testing Data Codebook
The data we have for English Language Arts / Literature is also from the California Department of Education, specifically from 2022. It tells us for each student group within each school their level of proficiency.
Physical Education Data
Our PE data comes from the California Department Education from the 2018-2019 school year. It has 7 different types of exercises and each school’s grade’s proficiency on each type of exercise.
School Funding Data Codebook
Our school funding data is aggregated 2019-2020 data from different federal and state sources. It is compiled into the dataset we are using by Georgetown University researchers. It has information about funding going to a school from the state, local, and federal governments, as well as metadata about the school such as enrollment, as well as data about the income levels of the students at the school.
Educational Opportunity Project at Stanford University (SEDA) Codebook Covariate Codebook
The SEDA we’re using contains school-level standardized academic achievement data across all Californian schools. These achievement scores are graded and cohort standardized against the NAEP standard, indicating whether the students in a particular school and grade level are meeting the national standard for their grade. For instance, if a school’s 4th-grade students score 3.5, it indicates that they are lagging behind the national standard by 0.5 points. The achievement estimates are calculated using Ordinary Least Square (OLS) and Empirical Bayesian (EB) techniques.
School Details
This contains metadata on 10629 California Schools. Including both the nationally used NCES id and the California CDS code. We use this to join our data from our different sources. This dataset also contains the longitude and latitude of the schools which has been very useful for EDA so far.
To ensure privacy concerns are addressed, we take measures to aggregate the data at the school level. However, we recognize that there may be potential issues with bias in the data collection process and data accuracy as a result.
Regarding potential bias, the data is taken from various sources, and it is possible that each school may have different processes for collecting the data. With the exception of the Educational Opportunity Project at Stanford University, we do not have information on the number of students for whom the data is collected, nor the demographic makeup of those students. As a result, we acknowledge that there may be inherent biases in the data that we cannot control due to a lack of information.
Regarding data accuracy, as we do not have detailed documentation for all of the datasets, it is challenging to ascertain their accuracy. However, we have confidence in the reliability of the government and highly-credited sources from which the data originates. Given this, we consider these datasets to be our most reliable option at present.
Taken out of context, our analyses and graphics have the potential to negatively impact educational policy. Particularly since we will be looking at demographics and funding data. We need to be careful and deliberate in our analysis in order to minimize harm.
Thu’s results
This map shows the deviation in SEDA scores from the national standard for each county, providing a broad overview of academic achievement levels in Californian counties. Overall, areas in cities or richer suburbs (Silicon Valley, Los Angeles, San Diego) have higher academic performance than national average, shown through brigher colors (yellow and green). On the other hand, areas with fewer schools (e.g. Inyo – a national forest area) have lower academic performance than national average, denoted by darker (blue and purple) colors.
Jeremy’s results
We examined solely the schools for which we have complete data available, i.e., the schools in the intersection of our datasets. Among these schools, across all metrics, an increase in per-pupil government spending (both state and federal) showed a negative correlation with performance. At first glance, the outcomes appear to be linked to the increase in per-pupil spending when there is a significant percentage of students who are English learners, in the foster care system, or eligible for free/reduced lunch.
Nathaniel’s results
Spending per student varies significantly across the state. We thought that this might explain some of the funding-achievement issue we had come across. After seeing that the same negative relationship existed to some degree in each locale we decided that location, or at least type of area, was not the factor we were looking for. It is also worth noting that the ‘Rural’ and ‘Town’ areas seem to be under performing for some reason.
We understand that funding for schools in California is based in part
on the percent of students eligible for free and reduced lunch, the
percent in the foster care system, and the percent that have limited
English. So we decided to fit some simple models of funding and these
variables. We then adjusted per student spending based on the simplest
model: funding ~ free and reduced lunch. This flattened the
relationship between funding and achievement and flipped the
relationship between percent Hispanic and spending.
The flattened relationship persists in each area.
The percent of students eligible for free and reduced lunch directly influences spending in the state of California. However, in wealthy localities with strong property tax bases, funding often exceeds the amount allotted by the state of California. These areas also have a lower percent of students eligible for free and reduced lunch. So our model is not looking directly at the supplemental grant received for the percent of students eligible for free and reduced lunch. Nevertheless we think it provides a more accurate picture than the raw spending metric.
The adjustment is very very rudimentary and would benefit from more investigation. We could probably use school location in conjunction with census data to get some picture of property taxes and local funding, but variation in local tax and funding structures could make this difficult.
Moving forward, we would love to:
Expand further on funding metrics and explore ways to adjust them in a way that they don’t give a false picture when taken out of context (Nathaniel has put in some work in this regard and we have a plan to accomplish this)
Identify more potential metrics through bivariate visualizations and adjust them so that we can put all of them together into a model that explains different education proxies
Tell a better story with missing data (for example, why certain data is missing, where is it from, where does the data in our different datasets overlap and where does it not, are there trends here or not)
Thu, Nathaniel, and Jeremy all contribute equally to this checkpoint. Specifically:
Thu was responsible for cleaning and standardizing school identifiers to merge all datasets together in long format. She also cleaned the data for and visualized the general map displaying the deviation in SEDA scores from the national standard for each county, which provides a broad overview of academic achievement levels in Californian counties. Lastly, she consolidated the narration for the results and also this write-up using all of Jeremy and Nathaniel’s inputs, and organized the slides for the intermediate presentation.
Nathaniel was responsible for cleaning and merging the dataset into a wide format, as well as creating several visualizations that aided us in developing a narrative about the funding for this intermediate visualization. Additionally, he proposed a new concept for modeling an adjusted funding metric, which will provide us with a more comprehensive understanding of the correlation between funding and academic achievement at the school level.
Jeremy was responsible for gathering and aggregating some of the initial datasets that Thu then later merged with other datasets. He also conducted various analyses to compare the trend of different educational proxies as school funding increases. Due to his extensive knowledge about California, he is the primary result interpreter of the team. This enables us to contextualize the outcomes and generate ideas for future steps. Furthermore, he played a role in exploring different potential variables by cleaning the data for and visualizing various bivariate graphs.